2nd June 2016
apt and yes frequent updatesDavid Robinson summarized the goal on his laptop
Whenever you’re learning a new tool, for a long time you’re going to suck… But the good news is that it's typical, that’s something that happens to everyone, and it’s only temporary. – Hadley Wickham
magrittr by Stefan Milton Bache
Compare
## [1] 0.2147669
## [1] 0.215
with
## [1] 0.215
Easy to read, natural from left to right.
Even better with one instruction per line and indentation
## [1] 0.215
There's a cheatsheet!
## Group.1 x ## 1 4 26.66364 ## 2 6 19.74286 ## 3 8 15.10000
## Source: local data frame [3 x 2] ## ## cyl mean(mpg) ## (dbl) (dbl) ## 1 4 26.66364 ## 2 6 19.74286 ## 3 8 15.10000
step-by-step tidying and manipulating data frames
Convert data to a tbl_df so that it uses dplyr's nice print method. No need for dim()
## Source: local data frame [336,776 x 19] ## ## year month day dep_time sched_dep_time dep_delay arr_time ## (int) (int) (int) (int) (int) (dbl) (int) ## 1 2013 1 1 517 515 2 830 ## 2 2013 1 1 533 529 4 850 ## 3 2013 1 1 542 540 2 923 ## 4 2013 1 1 544 545 -1 1004 ## 5 2013 1 1 554 600 -6 812 ## 6 2013 1 1 554 558 -4 740 ## 7 2013 1 1 555 600 -5 913 ## 8 2013 1 1 557 600 -3 709 ## 9 2013 1 1 557 600 -3 838 ## 10 2013 1 1 558 600 -2 753 ## .. ... ... ... ... ... ... ... ## Variables not shown: sched_arr_time (int), arr_delay (dbl), carrier (chr), ## flight (int), tailnum (chr), origin (chr), dest (chr), air_time (dbl), ## distance (dbl), hour (dbl), minute (dbl), time_hour (time)
Use glimpse to few some values in each column.
## Observations: 336,776 ## Variables: 19 ## $ year (int) 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,... ## $ month (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,... ## $ day (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,... ## $ dep_time (int) 517, 533, 542, 544, 554, 554, 555, 557, 557, 55... ## $ sched_dep_time (int) 515, 529, 540, 545, 600, 558, 600, 600, 600, 60... ## $ dep_delay (dbl) 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2... ## $ arr_time (int) 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7... ## $ sched_arr_time (int) 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7... ## $ arr_delay (dbl) 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -... ## $ carrier (chr) "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",... ## $ flight (int) 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79... ## $ tailnum (chr) "N14228", "N24211", "N619AA", "N804JB", "N668DN... ## $ origin (chr) "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"... ## $ dest (chr) "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"... ## $ air_time (dbl) 227, 227, 160, 183, 116, 150, 158, 53, 140, 138... ## $ distance (dbl) 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94... ## $ hour (dbl) 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,... ## $ minute (dbl) 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ... ## $ time_hour (time) 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
How many flights flew to La Guardia, NY in 2013? Expecting none…
## Source: local data frame [1 x 19] ## ## year month day dep_time sched_dep_time dep_delay arr_time ## (int) (int) (int) (int) (int) (dbl) (int) ## 1 2013 7 27 NA 106 NA NA ## Variables not shown: sched_arr_time (int), arr_delay (dbl), carrier (chr), ## flight (int), tailnum (chr), origin (chr), dest (chr), air_time (dbl), ## distance (dbl), hour (dbl), minute (dbl), time_hour (time)
base version (could be subset)
## Source: local data frame [1 x 19] ## ## year month day dep_time sched_dep_time dep_delay arr_time ## (int) (int) (int) (int) (int) (dbl) (int) ## 1 2013 7 27 NA 106 NA NA ## Variables not shown: sched_arr_time (int), arr_delay (dbl), carrier (chr), ## flight (int), tailnum (chr), origin (chr), dest (chr), air_time (dbl), ## distance (dbl), hour (dbl), minute (dbl), time_hour (time)
How many flights flew to Madison in first week of January?
## Source: local data frame [6 x 19] ## ## year month day dep_time sched_dep_time dep_delay arr_time ## (int) (int) (int) (int) (int) (dbl) (int) ## 1 2013 1 1 1353 1357 -4 1549 ## 2 2013 1 2 1422 1357 25 1604 ## 3 2013 1 3 1415 1351 24 1540 ## 4 2013 1 4 1345 1350 -5 1525 ## 5 2013 1 6 1340 1345 -5 1506 ## 6 2013 1 7 1348 1350 -2 1625 ## Variables not shown: sched_arr_time (int), arr_delay (dbl), carrier (chr), ## flight (int), tailnum (chr), origin (chr), dest (chr), air_time (dbl), ## distance (dbl), hour (dbl), minute (dbl), time_hour (time)
Commas in the filter statement are implicit
&(and) operators. Is there anything similar for|(or)?
Logical or statements are supported, but there's no shorthand.
For more complicated checks, try a set operation.
Sort by which airport they departed from in NYC, then year, month, day.
## Source: local data frame [336,776 x 19] ## ## year month day dep_time sched_dep_time dep_delay arr_time ## (int) (int) (int) (int) (int) (dbl) (int) ## 1 2013 1 1 517 515 2 830 ## 2 2013 1 1 554 558 -4 740 ## 3 2013 1 1 555 600 -5 913 ## 4 2013 1 1 558 600 -2 923 ## 5 2013 1 1 559 600 -1 854 ## 6 2013 1 1 601 600 1 844 ## 7 2013 1 1 606 610 -4 858 ## 8 2013 1 1 607 607 0 858 ## 9 2013 1 1 608 600 8 807 ## 10 2013 1 1 615 615 0 833 ## .. ... ... ... ... ... ... ... ## Variables not shown: sched_arr_time (int), arr_delay (dbl), carrier (chr), ## flight (int), tailnum (chr), origin (chr), dest (chr), air_time (dbl), ## distance (dbl), hour (dbl), minute (dbl), time_hour (time)
Find longest delayed flights to Madison.
## Source: local data frame [572 x 19] ## ## year month day dep_time sched_dep_time dep_delay arr_time ## (int) (int) (int) (int) (int) (dbl) (int) ## 1 2013 9 12 1841 1350 291 2135 ## 2 2013 12 5 2000 1420 340 2132 ## 3 2013 10 7 1912 1425 287 2048 ## 4 2013 3 8 1907 1405 302 2031 ## 5 2013 3 14 1845 1405 280 2026 ## 6 2013 11 17 45 1935 310 206 ## 7 2013 6 30 1855 1415 280 2013 ## 8 2013 10 11 1857 1425 272 2011 ## 9 2013 2 6 2242 1825 257 15 ## 10 2013 12 31 1819 1505 194 2023 ## .. ... ... ... ... ... ... ... ## Variables not shown: sched_arr_time (int), arr_delay (dbl), carrier (chr), ## flight (int), tailnum (chr), origin (chr), dest (chr), air_time (dbl), ## distance (dbl), hour (dbl), minute (dbl), time_hour (time)
Find the most delayed (in minutes) flight in 2013
## Source: local data frame [336,776 x 19] ## ## year month day dep_time sched_dep_time dep_delay arr_time ## (int) (int) (int) (int) (int) (dbl) (int) ## 1 2013 1 9 641 900 1301 1242 ## 2 2013 6 15 1432 1935 1137 1607 ## 3 2013 1 10 1121 1635 1126 1239 ## 4 2013 9 20 1139 1845 1014 1457 ## 5 2013 7 22 845 1600 1005 1044 ## 6 2013 4 10 1100 1900 960 1342 ## 7 2013 3 17 2321 810 911 135 ## 8 2013 7 22 2257 759 898 121 ## 9 2013 12 5 756 1700 896 1058 ## 10 2013 5 3 1133 2055 878 1250 ## .. ... ... ... ... ... ... ... ## Variables not shown: sched_arr_time (int), arr_delay (dbl), carrier (chr), ## flight (int), tailnum (chr), origin (chr), dest (chr), air_time (dbl), ## distance (dbl), hour (dbl), minute (dbl), time_hour (time)
## [1] 21.2
Select the columns you want.
## Source: local data frame [336,776 x 4] ## ## origin year month day ## (chr) (int) (int) (int) ## 1 EWR 2013 1 1 ## 2 LGA 2013 1 1 ## 3 JFK 2013 1 1 ## 4 JFK 2013 1 1 ## 5 LGA 2013 1 1 ## 6 EWR 2013 1 1 ## 7 EWR 2013 1 1 ## 8 LGA 2013 1 1 ## 9 JFK 2013 1 1 ## 10 LGA 2013 1 1 ## .. ... ... ... ...
select has many helper functions. See ?select.
## Source: local data frame [336,776 x 6] ## ## origin year month day dep_time dep_delay ## (chr) (int) (int) (int) (int) (dbl) ## 1 EWR 2013 1 1 517 2 ## 2 LGA 2013 1 1 533 4 ## 3 JFK 2013 1 1 542 2 ## 4 JFK 2013 1 1 544 -1 ## 5 LGA 2013 1 1 554 -6 ## 6 EWR 2013 1 1 554 -4 ## 7 EWR 2013 1 1 555 -5 ## 8 LGA 2013 1 1 557 -3 ## 9 JFK 2013 1 1 557 -3 ## 10 LGA 2013 1 1 558 -2 ## .. ... ... ... ... ... ...
We can drop columns by "negating" the name. Since helpers give us column names, we can negate them too.
## Source: local data frame [336,776 x 12] ## ## year month day dep_delay carrier flight tailnum origin distance ## (int) (int) (int) (dbl) (chr) (int) (chr) (chr) (dbl) ## 1 2013 1 1 2 UA 1545 N14228 EWR 1400 ## 2 2013 1 1 4 UA 1714 N24211 LGA 1416 ## 3 2013 1 1 2 AA 1141 N619AA JFK 1089 ## 4 2013 1 1 -1 B6 725 N804JB JFK 1576 ## 5 2013 1 1 -6 DL 461 N668DN LGA 762 ## 6 2013 1 1 -4 UA 1696 N39463 EWR 719 ## 7 2013 1 1 -5 B6 507 N516JB EWR 1065 ## 8 2013 1 1 -3 EV 5708 N829AS LGA 229 ## 9 2013 1 1 -3 B6 79 N593JB JFK 944 ## 10 2013 1 1 -2 AA 301 N3ALAA LGA 733 ## .. ... ... ... ... ... ... ... ... ... ## Variables not shown: hour (dbl), minute (dbl), time_hour (time)
tbl_df - nice print method. Now in [tibble](https://github.com/hadley/tibble)glimpse - some of each columnfilter - subsettingarrange - sorting (desc to reverse the sort)select - picking (and omiting) columnsRename columns with rename(NewName = OldName). To keep the order correct, read/remember the renaming = as "was".
## Source: local data frame [336,776 x 19] ## ## y m d dep_time sched_dep_time dep_delay arr_time ## (int) (int) (int) (int) (int) (dbl) (int) ## 1 2013 1 1 517 515 2 830 ## 2 2013 1 1 533 529 4 850 ## 3 2013 1 1 542 540 2 923 ## 4 2013 1 1 544 545 -1 1004 ## 5 2013 1 1 554 600 -6 812 ## 6 2013 1 1 554 558 -4 740 ## 7 2013 1 1 555 600 -5 913 ## 8 2013 1 1 557 600 -3 709 ## 9 2013 1 1 557 600 -3 838 ## 10 2013 1 1 558 600 -2 753 ## .. ... ... ... ... ... ... ... ## Variables not shown: sched_arr_time (int), arr_delay (dbl), carrier (chr), ## flight (int), tailnum (chr), origin (chr), dest (chr), air_time (dbl), ## distance (dbl), hour (dbl), minute (dbl), time_hour (time)
## Source: local data frame [336,776 x 3] ## ## gain speed gain_per_hour ## (dbl) (dbl) (dbl) ## 1 9 370.0441 2.378855 ## 2 16 374.2731 4.229075 ## 3 31 408.3750 11.625000 ## 4 -17 516.7213 -5.573770 ## 5 -19 394.1379 -9.827586 ## 6 16 287.6000 6.400000 ## 7 24 404.4304 9.113924 ## 8 -11 259.2453 -12.452830 ## 9 -5 404.5714 -2.142857 ## 10 10 318.6957 4.347826 ## .. ... ... ...
## Warning: Removed 296 rows containing missing values (geom_point).
to split time between hours and minutes
## Source: local data frame [336,776 x 3] ## ## dep_time hour minute ## (int) (dbl) (dbl) ## 1 517 5 17 ## 2 533 5 33 ## 3 542 5 42 ## 4 544 5 44 ## 5 554 6 54 ## 6 554 5 54 ## 7 555 6 55 ## 8 557 6 57 ## 9 557 6 57 ## 10 558 6 58 ## .. ... ... ...
Let's compute the average delay per month of flights to Madison.
Normally–in aggregate, by or plyr's d*ply functions–you specify the grouping as an argument to the aggregation function.
## month dep_delay ## 1 1 18.07692 ## 2 2 20.11111 ## 3 3 41.87097 ## 4 4 29.40741 ## 5 5 21.54839 ## 6 6 29.83333 ## 7 7 11.13333 ## 8 8 19.06667 ## 9 9 15.97183 ## 10 10 19.26190 ## 11 11 15.96250 ## 12 12 42.68831
In dplyr, grouping has its own action.
Here, we group_by date. This data.frame is not tidy, but muti-grouping solve it. See the helpful reminder from tbl_df print method
## Source: local data frame [12 x 2] ## ## month mean_dep_delay ## (int) (dbl) ## 1 1 18.07692 ## 2 2 20.11111 ## 3 3 41.87097 ## 4 4 29.40741 ## 5 5 21.54839 ## 6 6 29.83333 ## 7 7 11.13333 ## 8 8 19.06667 ## 9 9 15.97183 ## 10 10 19.26190 ## 11 11 15.96250 ## 12 12 42.68831
Work per day
## Source: local data frame [336,776 x 19] ## Groups: year, month, day [365] ## ## year month day dep_time sched_dep_time dep_delay arr_time ## (int) (int) (int) (int) (int) (dbl) (int) ## 1 2013 1 1 517 515 2 830 ## 2 2013 1 1 533 529 4 850 ## 3 2013 1 1 542 540 2 923 ## 4 2013 1 1 544 545 -1 1004 ## 5 2013 1 1 554 600 -6 812 ## 6 2013 1 1 554 558 -4 740 ## 7 2013 1 1 555 600 -5 913 ## 8 2013 1 1 557 600 -3 709 ## 9 2013 1 1 557 600 -3 838 ## 10 2013 1 1 558 600 -2 753 ## .. ... ... ... ... ... ... ... ## Variables not shown: sched_arr_time (int), arr_delay (dbl), carrier (chr), ## flight (int), tailnum (chr), origin (chr), dest (chr), air_time (dbl), ## distance (dbl), hour (dbl), minute (dbl), time_hour (time)
Note that one level (rigth most) is removed from grouping.
Now we use summarise to compute (several) aggregate values within each group (per day). summarise returns one row per group.
## Source: local data frame [365 x 6] ## Groups: year, month [?] ## ## year month day flights avg_delay n_planes ## (int) (int) (int) (int) (dbl) (int) ## 1 2013 1 1 842 11.548926 649 ## 2 2013 1 2 943 13.858824 712 ## 3 2013 1 3 914 10.987832 689 ## 4 2013 1 4 915 8.951595 689 ## 5 2013 1 5 720 5.732218 578 ## 6 2013 1 6 832 7.148014 640 ## 7 2013 1 7 933 5.417204 682 ## 8 2013 1 8 899 2.553073 667 ## 9 2013 1 9 902 2.276477 666 ## 10 2013 1 10 932 2.844995 689 ## .. ... ... ... ... ... ...
In average, how many flights a single plane is doing per day?
plot the distribution, display the mean / median (geom_vline())
plot the average delay per day. Use tidyr:unite and as.Date
which day should be avoided?
## Source: local data frame [365 x 7] ## Groups: year, month [12] ## ## year month day flights avg_delay n_planes avg_flights ## (int) (int) (int) (int) (dbl) (int) (dbl) ## 1 2013 1 1 842 11.548926 649 1.297381 ## 2 2013 1 2 943 13.858824 712 1.324438 ## 3 2013 1 3 914 10.987832 689 1.326560 ## 4 2013 1 4 915 8.951595 689 1.328012 ## 5 2013 1 5 720 5.732218 578 1.245675 ## 6 2013 1 6 832 7.148014 640 1.300000 ## 7 2013 1 7 933 5.417204 682 1.368035 ## 8 2013 1 8 899 2.553073 667 1.347826 ## 9 2013 1 9 902 2.276477 666 1.354354 ## 10 2013 1 10 932 2.844995 689 1.352685 ## .. ... ... ... ... ... ... ...
## Warning: Stacking not well defined when ymin != 0
## Source: local data frame [365 x 4] ## Groups: year, month [12] ## ## year month day avg_delay ## (int) (int) (int) (dbl) ## 1 2013 1 31 28.65836 ## 2 2013 1 30 28.62344 ## 3 2013 1 16 24.61287 ## 4 2013 1 25 21.89853 ## 5 2013 1 13 19.87315 ## 6 2013 1 24 19.46542 ## 7 2013 1 28 15.13853 ## 8 2013 1 2 13.85882 ## 9 2013 1 22 12.49944 ## 10 2013 1 1 11.54893 ## .. ... ... ... ...
What's wrong?
Mind that arrange use grouping! (will change in version 0.4.5)
## Source: local data frame [365 x 4] ## ## year month day avg_delay ## (int) (int) (int) (dbl) ## 1 2013 3 8 83.53692 ## 2 2013 7 1 56.23383 ## 3 2013 9 2 53.02955 ## 4 2013 7 10 52.86070 ## 5 2013 12 5 52.32799 ## 6 2013 5 23 51.14472 ## 7 2013 9 12 49.95875 ## 8 2013 6 28 48.82778 ## 9 2013 6 24 47.15742 ## 10 2013 7 22 46.66705 ## .. ... ... ... ...
## Source: local data frame [101 x 3] ## ## dest mean n ## (chr) (dbl) (int) ## 1 CAE 41.76415 106 ## 2 TUL 33.65986 294 ## 3 OKC 30.61905 315 ## 4 JAC 28.09524 21 ## 5 TYS 24.06920 578 ## 6 MSN 20.19604 556 ## 7 RIC 20.11125 2346 ## 8 CAK 19.69834 842 ## 9 DSM 19.00574 523 ## 10 GRR 18.18956 728 ## .. ... ... ...
First get the GPS coordinate of airports using the data frame airports
## Source: local data frame [1,396 x 7] ## ## faa name lat lon alt tz ## (chr) (chr) (dbl) (dbl) (int) (dbl) ## 1 04G Lansdowne Airport 41.13047 -80.61958 1044 -5 ## 2 06A Moton Field Municipal Airport 32.46057 -85.68003 264 -5 ## 3 06C Schaumburg Regional 41.98934 -88.10124 801 -6 ## 4 06N Randall Airport 41.43191 -74.39156 523 -5 ## 5 09J Jekyll Island Airport 31.07447 -81.42778 11 -4 ## 6 0A9 Elizabethton Municipal Airport 36.37122 -82.17342 1593 -4 ## 7 0G6 Williams County Airport 41.46731 -84.50678 730 -5 ## 8 0G7 Finger Lakes Regional Airport 42.88356 -76.78123 492 -5 ## 9 0P2 Shoestring Aviation Airfield 39.79482 -76.64719 1000 -5 ## 10 0S9 Jefferson County Intl 48.05381 -122.81064 108 -8 ## .. ... ... ... ... ... ... ## Variables not shown: dst (chr)
We could have used left_join but 4 rows with a 3-letters acronym have no correspondance in the airports data frame. inner_join narrow down the lines that are present in both data frames.

Of note: anti_join can select rows which identifiers are absent in the second data frame.
class: small-code
## ## # maps v3.1: updated 'world': all lakes moved to separate new # ## # 'lakes' database. Type '?world' or 'news(package="maps")'. #
tally is a shortcut for counting number of items per group.
## Source: local data frame [1,113 x 3] ## Groups: dest [?] ## ## dest month n ## (chr) (int) (int) ## 1 ABQ 4 9 ## 2 ABQ 5 31 ## 3 ABQ 6 30 ## 4 ABQ 7 31 ## 5 ABQ 8 31 ## 6 ABQ 9 30 ## 7 ABQ 10 31 ## 8 ABQ 11 30 ## 9 ABQ 12 31 ## 10 ACK 5 21 ## .. ... ... ...
could sum over by multiple call of tally
count does the grouping for you
## Source: local data frame [1,113 x 3] ## Groups: dest [?] ## ## dest month n ## (chr) (int) (int) ## 1 ABQ 4 9 ## 2 ABQ 5 31 ## 3 ABQ 6 30 ## 4 ABQ 7 31 ## 5 ABQ 8 31 ## 6 ABQ 9 30 ## 7 ABQ 10 31 ## 8 ABQ 11 30 ## 9 ABQ 12 31 ## 10 ACK 5 21 ## .. ... ... ...
## Source: local data frame [15 x 2] ## ## height weight ## (int) (int) ## 1 58 115 ## 2 59 117 ## 3 60 120 ## 4 61 123 ## 5 62 126 ## 6 63 129 ## 7 64 132 ## 8 65 135 ## 9 66 139 ## 10 67 142 ## 11 68 146 ## 12 69 150 ## 13 70 154 ## 14 71 159 ## 15 72 164
Using Rstudio, right bottom panel. Select the folder
When the folder is selected and its content displayed, tell R to use this location (alternative to setwd(/path/to/location)). Press the More menu and click on
Set As Working Directory
Using Rstudio, right top panel. Select directly your file
R is rather flexible and permissive with its syntax. Howwver, being more strict tends to ease the debugging process.
Overview
Switch from "wide" format to "long" format
See "tidy-data-playbook" by Garrett Grolemund
Example
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa
## [1] 150 5
## [1] 300 5
## Sepal.Length Sepal.Width Species petal value ## 1 5.1 3.5 setosa Petal.Length 1.4 ## 2 4.9 3.0 setosa Petal.Length 1.4 ## 3 4.7 3.2 setosa Petal.Length 1.3 ## 4 4.6 3.1 setosa Petal.Length 1.5 ## 5 5.0 3.6 setosa Petal.Length 1.4
Excellent ressource about R (in French) by Ewen Gallic
Hadley Ewen Simon David